In [1]:
from time import time
start = time()
In [31]:
import numpy as np
import pandas as pd
import scipy.stats as stats
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.figure_factory as ff
import plotly.express as px
from sklearn.neighbors import KernelDensity
import rpy2.robjects as robjects
In [56]:
# Bright seaborn plots
sns.set_theme(palette="bright")

# load R extention to jupyter-lab cells
# (useful to combine python and R coding, but R is limited here, use R-studio instead
# if you gonna use only R)
%load_ext rpy2.ipython
The rpy2.ipython extension is already loaded. To reload it, use:
  %reload_ext rpy2.ipython

Confidence Intervals (CI) and Bootstrapping¶

A confidence interval is a range of values representing how confidence we are about some parameter (e.g mean, mode, median, etc) of the population based only in a sample.

Usually when working with data we only have a sample of the population rather than the whole population. However, we are able to estimate a range of values of possible values of a parameter (generally the mean o $\mu$) of the population if we have a significant sample data. Thus, the central limit theorem states that the distribution of sample means approximates a normal distribution and, as you already may know, we are able to retrieve aprox. 95% of our data covering two (to be more accurate is 1.96) standard deviation ($\sigma$) to the right and left from the mean.

Getting the means of random samples of the population and retrieving the 95% of that distribution (normal distribution because the central limit theorem) give us the confidence interval of the true mean population with a confidence of 95% based on the sample, the larger the data sample size, the better the confidence interval obtained. We can also obtain other confidence intervals, for example, a confidence interval of 99% taking aprox 3$\sigma$ from the rigth and left of the distribution of the means.

If we have multiples samples, the confidence interval can be computed getting the means of the samples and then retrieving the limit inferior of the real values of the parameter as the percentile number 0.025 and the superior with aprox 0.975, the range of values between those percentiles encompasses the 95% of all the data but, if we have only one sample, we can calculate the standard error (SE), which is the standard deviation of the means of multiples samples of the population, and multiply it for 1.96 ($\sigma$)

$$ limit\ inferior = \mu - SE * 1.96 $$$$ limit\ superior = \mu + SE * 1.96 $$

The general representation to calculate the CI is also shown as:

$$ CI = \mu\ \pm \ z * \frac{s}{\sqrt{n}}\ \ \ \ \ \ \ \ z=confidence\ level\ value;\ s=standard\ deviation\ of\ the\ sample;\ n=sample\ size $$

Calculating the CI as shown above assumes your data is normal distributed, which is not always the case. In this situation, we also take advantage of a powerful method called bootstrapping.

Bootstrapping basically does a resampling of your sample calculating virtually any parameter of interest to us wich tries to describe the population parameter. This resampling is computed taking a sample of your sample of the same size but allowing repetition, for each sampled sample the parameter is calculated and, from here, several methods have been determined to calculate the CI: empirical bootstapping (here also known as basic or reverse), normal, percentile, etc.

Thus, for example, normal bootstrapping uses the definition of CI for normal data (see above) because the parameters obtained from the sampled samples follow a normal distribution, substituting the $mu$ for the $mu^*$ or mean of your sampled samples or the parameter of your interest and the $SE$ for the $SE^*$ again, of your sampled samples.

On the other hand, empirical defines the CI for the mean as:

$$ CI = (2\overline{x} - x^*_{1 - \frac{\alpha}{2}},\ 2\overline{x} - x^*_{\frac{\alpha}{2}}) $$$$ where\ \overline{x}\ is\ the\ mean\ of\ the\ original\ data(the\ sample) $$$$ x^*\ is\ the\ mean\ of\ the\ sampled\ samples\ found\ in\ the\ percentiles\\ $$$$ 1\ -\ \frac{\alpha}{2}\ and\ \frac{\alpha}{2}\ (these\ are\ the\ percentiles) $$
In [4]:
# Lets create random normal data to ilustrate the interval confidence
norm_pop = np.random.normal(10, 1, 100000)

print(f"True mean population: {norm_pop.mean()}")
True mean population: 9.99892386569881
In [5]:
# Now lets get the confidence intervals with a confidence of 95% based on only a random sample

def ShowCI95(population:"array-like", sample_size:int):
    
    '''
    Show on screen the 95% confidence interval of the mean
    
    This functions print on screen the range of values of the 95% confidence intervals based on a 
    sample of the population provided
    
    Parameters
    ----------
    
    population: array-like
        array-like representing the population where a sample of size sample_size will be taken
        
    sample_size: integer, int
        size of the sample
            
    
    Examples
    ----------
    
    >>>import numpy as np
    >>>import pandas as pd
    >>>data = np.random.normal(0,1,10000)
    >>>ShowCI95(data, 1000)
    Sample size: 1000
    Sample mean: -0.006450809541726166
    True mean population is between -0.06841125689868024 and 0.055509637815227914 with a confidence of 95%
    The sample mean has an aprox error of -0.006 +/- 0.062
    
    
    '''

    norm_sam = np.random.choice(population, sample_size)
    
    # sem() is a pandas function to get the SE
    error = pd.Series(norm_sam).sem() * 1.96
    mu = norm_sam.mean()
    
    l_inf = mu - error
    l_sup = mu + error

    print(f"Sample size: {norm_sam.shape[0]}")
    print(f"Sample mean: {norm_sam.mean()}")
    print(f"True mean population is between {l_inf} and {l_sup} with a confidence of 95%")
    print(f"The sample mean has an aprox error of {round(mu, 3)} +/- {round(error, 3)}")
    
ShowCI95(norm_pop, 5000)
Sample size: 5000
Sample mean: 9.999882072842729
True mean population is between 9.971736335678909 and 10.028027810006549 with a confidence of 95%
The sample mean has an aprox error of 10.0 +/- 0.028
In [6]:
# Lets now to try to find out the confidence interval using random sampling of the population
# Every sample will be of 5000 data points and 100 samples will be done

print(f"True population mean {norm_pop.mean()}")

means = []

for experiment in range(100):
    mu = np.random.choice(norm_pop, 5000).mean()
    means.append(mu)

means = pd.Series(means)
    
print(f"The mean of the sample means is {means.mean()}")

# If we take the percentiles covering the 95 percent of the means obtained, we are getting the confidence
# intervals with a confidence of 95%
P_025 = means.quantile(0.025)
P_975 = means.quantile(0.975)

print(f"The 95% confidence interval range from the value {P_025} to {P_975}")
True population mean 9.99892386569881
The mean of the sample means is 9.998894739238308
The 95% confidence interval range from the value 9.975312085880674 to 10.02294945027279
In [7]:
# The error increases as the sample data size decreases

for size in [10000,5000,2500,1000,500,250,125]:
    ShowCI95(norm_pop, size)
    print("")
Sample size: 10000
Sample mean: 10.005574696966415
True mean population is between 9.98607221306244 and 10.025077180870388 with a confidence of 95%
The sample mean has an aprox error of 10.006 +/- 0.02

Sample size: 5000
Sample mean: 10.006389840895665
True mean population is between 9.97897845183047 and 10.03380122996086 with a confidence of 95%
The sample mean has an aprox error of 10.006 +/- 0.027

Sample size: 2500
Sample mean: 9.927657132062514
True mean population is between 9.888139515400704 and 9.967174748724323 with a confidence of 95%
The sample mean has an aprox error of 9.928 +/- 0.04

Sample size: 1000
Sample mean: 9.938950538840226
True mean population is between 9.874657825927429 and 10.003243251753023 with a confidence of 95%
The sample mean has an aprox error of 9.939 +/- 0.064

Sample size: 500
Sample mean: 9.99482637010372
True mean population is between 9.909101829156397 and 10.080550911051043 with a confidence of 95%
The sample mean has an aprox error of 9.995 +/- 0.086

Sample size: 250
Sample mean: 9.910705896216525
True mean population is between 9.784947006278523 and 10.036464786154527 with a confidence of 95%
The sample mean has an aprox error of 9.911 +/- 0.126

Sample size: 125
Sample mean: 10.124699080548902
True mean population is between 9.962122625960186 and 10.287275535137619 with a confidence of 95%
The sample mean has an aprox error of 10.125 +/- 0.163

In [8]:
# The mean of the sample means is pretty much the same as the population mean but, in reality, we wouldn't know that
# Now lets see the population distribution and the distributions of the sample means

# Population plot
fig, (ax_1, ax_2) = plt.subplots(1,2, figsize=(17, 8))

sns.kdeplot(norm_pop, linewidth=2, ax=ax_1)
ax_1.axvline(norm_pop.mean(), c="black", linestyle="--", label="$\mu$")
ax_1.set_title("\nPopulation distribution\n", fontdict={"size":17})
ax_1.legend(fontsize=14)

# Means plot
sns.kdeplot(means, linewidth=2, ax=ax_2)

# Mean and 95% of the data
ax_2.axvline(means.mean(), c="black", linestyle="--", label="$\mu$")
ax_2.axvline(P_025, c="orange", linestyle="dotted", linewidth=3, label="$CI$")
ax_2.axvline(P_975, c="orange", linestyle="dotted", linewidth=3)

X_fill, Y_fill = ax_2.get_lines()[0].get_data()
mask = (X_fill >= P_025) & (X_fill <= P_975)
ax_2.fill_between(X_fill[mask], Y_fill[mask], color="orange", alpha=0.7, label="$95\%\ CI$")

ax_2.set_title("\nDistribution of means (sample means)\n", fontdict={"size":17})
ax_2.legend(fontsize=14)

plt.subplots_adjust(wspace=0.1)
plt.show()
In [9]:
# Now let's bootstrapping for the basic and percentile method to ilustrate its power in non-normal data
# We are goint to create a Bootstrapping class just because, take your time to decode the python class below

class Bootstrapping:
    
    '''
    Bootstrapping resampling the sample
    
    This class does two types of bootsrapping based on the mean of the sample and return
    the low and upper confident intervals of the mean. The methods avaliable are basic
    bootstrapping, also known as Reverse Percentile Interval or Empirical bootstrapping.
    This function also computes the percentile boostrapping
    
    Parameters
    ----------
    
    data: array-like
        sample to be resampled
    
    tests: integer, int. Default 10,000
        integer indicating how many times the sample will be resample. Default 10,000
        
        
    Return
    ------
        tuple: tuple with the inferior limit and upper limit of the confidence interval
    
    
    Examples
    --------
    
    >>>from numpy.random import normal
    >>>sample = normal(0, 1, 1000)
    >>>boot = Bootstrapping(sample, tests=1000)
    >>>boot.basic()
    (-0.044194517957389545, 0.07898892888826403)
    
    '''
    
    def __init__(self, data:"array-like", tests:int=10000):
        
        self.data  = np.array(data).reshape(-1,)
        self.tests = tests
        
        self.mean = self.data.mean()
        self.std  = self.data.std()
    
    def __percentileLimits(self, means, lower, upper):
            return (np.percentile(means, lower), np.percentile(means, upper))
        
    def __bootstrap(self, sample):
            # Resampling the sample and computing the mean of the sample of the sample
            # This is basically bootstrapping
            means = []
            for _ in range(self.tests):
                mu = np.random.choice(self.data, self.data.shape[0], replace=True).mean()
                means.append(mu)
            
            return np.array(means).reshape(-1,)
        
    def basic(self, confidence=95):
        
        '''
        Get the confidence intervals based on the definition:

        CI = (low = 2X - x_{1 - a/2}, high = 2X - x_{a/2})
            where "X" is the sample/original mean, "x" is the mean found in the {1 - a/2} or 
            {a/2} percentile of the sample resampled and "a" means alpha, the value of the tail
            of the normal distribution
        
        '''
        
        means = self.__bootstrap(self.data)
        
        alpha = (100 - confidence) / 2
        lower_limit, upper_limit = self.__percentileLimits(means, alpha, 100 - alpha)
        
        return (2 * self.mean - upper_limit, 2 * self.mean - lower_limit)
    
    def percentile(self, confidence=95):
        
        '''
        Get the confidence intervals based on the definition:

        CI = (low = x_{a/2}, high = x_{1 - a/2})
            where x" is the mean found in the {1 - a/2} or {a/2} percentile of the
            sample resampled and "a" means alpha, the value of the tail of the normal
            distribution
        
        '''
        
        means = self.__bootstrap(self.data)
        alpha = (100 - confidence) / 2
        return self.__percentileLimits(means, alpha, 100 - alpha)
In [10]:
# Here is our non-normal data population and sample (no samples of the sample)

exp_pop = np.random.exponential(10, 5000)
exp_sam = np.random.choice(exp_pop ,250)

# Plotting using plotly to visualize better the means (interactive but no preview on github)
fig = ff.create_distplot([exp_pop, exp_sam], ["Population","Sample"], show_hist=False,
                         curve_type='kde', colors=["steelblue","orange"])

fig.add_vline(exp_pop.mean(), annotation={"text":"$\sigma$"},
              annotation_position="top right", line_color="steelblue",
              line_dash="dash")

fig.add_vline(exp_sam.mean(), annotation={"text":"$\sigma$"}, 
              annotation_position="top left", line_color="orange",
              line_dash="dot")

fig.update_layout(
    margin=dict(l=20, r=20, t=15, b=15),
    width=860,
    height=500,
    template="plotly_dark"
)

fig.show()
In [11]:
# Now let's get the CI using our class and resampling 10000 times (default value of the class)
boots = Bootstrapping(exp_sam)

# Basic and percentile with confidence of 95%
Confidence = 95
low_b, high_b = boots.basic(confidence=Confidence)
low_p, high_p = boots.percentile(confidence=Confidence)

print(f"\nTrue population mean:{exp_pop.mean()}")
print("\nCI USING BASIC AND PERCENTILE BOOTSTRAPPING")
print("\nBasic bootstrapping:")
print(f"\tThe {Confidence}% confidence interval range from the value {low_b} to {high_b}")
print("\nPercentile bootstrapping:")
print(f"\tThe {Confidence}% confidence interval range from the value {low_p} to {high_p}")
True population mean:9.88262809918132

CI USING BASIC AND PERCENTILE BOOTSTRAPPING

Basic bootstrapping:
	The 95% confidence interval range from the value 8.414255435709055 to 10.752011086961195

Percentile bootstrapping:
	The 95% confidence interval range from the value 8.496958528561187 to 10.810982384481969
In [12]:
# Let's plot the results
data_dict = {}
data_dict['category'] = ['Basic','Percentile']
data_dict['lower'] = [low_b, low_p]
data_dict['upper'] = [high_b, high_p]
dataset = pd.DataFrame(data_dict)

plt.figure(figsize=(15,7))
y_axis = np.arange(0, dataset.shape[0] / 2, 0.5)
for lower, upper,y in zip(dataset['lower'], dataset['upper'], y_axis):
    plt.plot((lower,upper),(y,y),'o-', color="orange", linewidth=3)

plt.axvline(exp_pop.mean(), linestyle="--", c="steelblue", label=f"Population $\mu={round(exp_pop.mean(),2)}$")
plt.axvline(exp_sam.mean(), linestyle="dotted", c="green", label=f"Sample $\mu={round(exp_sam.mean(),2)}$")

plt.ylim(-0.1, 1)
plt.xlim( exp_sam.mean() - (exp_sam.mean() / 5), exp_sam.mean() + (exp_sam.mean() / 5) )
plt.yticks(y_axis, list(dataset['category']), fontsize=15)
plt.legend(fontsize=12)
plt.title("\nConfidence interval (95%) of Percentile\nand Basic bootstrapping (orange line)\n", fontdict={"size":17})
plt.show()
In [13]:
# We don't need to code our functions/classes to calcute the CI using bootstrap. scipy has already an implementation
# for basic, percentile and another one called bca (which is a litle bit more complicated to code)

boots_types = {"percentile":np.NaN, "basic":np.NaN, "bca":np.NaN}

for key in boots_types:
    b = stats.bootstrap((exp_sam,), np.mean, confidence_level=0.95, method=key)
    boots_types[key] = b.confidence_interval

boots_types
Out[13]:
{'percentile': ConfidenceInterval(low=8.503408565386506, high=10.809201549049465),
 'basic': ConfidenceInterval(low=8.455806682698904, high=10.74773487091856),
 'bca': ConfidenceInterval(low=8.550763081545766, high=10.84050814242377)}
In [14]:
# Let's plot the results of scipy

b = boots_types
data_dict = {}
data_dict['category'] = ['percentile', 'basic', 'bca']
data_dict['lower'] = [b["percentile"].low, b["basic"].low, b["bca"].low]
data_dict['upper'] = [b["percentile"].high, b["basic"].high, b["bca"].high]
dataset = pd.DataFrame(data_dict)

plt.figure(figsize=(15,7))
y_axis = np.arange(0, dataset.shape[0] / 2, 0.5)
for lower, upper,y in zip(dataset['lower'], dataset['upper'], y_axis):
    plt.plot((lower,upper),(y,y),'o-', color="orange", linewidth=3)

plt.axvline(exp_pop.mean(), linestyle="--", c="steelblue", label=f"Population $\mu={round(exp_pop.mean(),2)}$")
plt.axvline(exp_sam.mean(), linestyle="dotted", c="green", label=f"Sample $\mu={round(exp_sam.mean(),2)}$")

plt.ylim(-0.1, 1.1)
plt.xlim( exp_sam.mean() - (exp_sam.mean() / 5), exp_sam.mean() + (exp_sam.mean() / 5) )
plt.yticks(y_axis, list(dataset['category']), fontsize=15)
plt.legend(loc="center right", fontsize=12, bbox_to_anchor=(0.985, 0.7))
plt.title("\nConfidence interval (95%) of Percentile\nand Basic bootstrapping (orange line)\n", fontdict={"size":17})
plt.show()

P-value¶

A p-value is the probability of obtaining equal or rarer observations from your data point or observation. This translates as the area under the curve of the tails of your data distribution (see above plot).

This p-value ($p$) is used to reject or not a null hypothesis. Any hypothesis that is not null is simply denoted as an alternative. To reject or not the null, an alpha ($\alpha$) value is chosen as a cutoff to denote at what probability an observation can be considered extraordinary, this $\alpha$ value is usually chosen as 0.05, although in reality it tends to be an arbitrary value. Thus, depending on the "error" one is willing to accept, $\alpha$ can vary, for example, being stricter with values of 0.001 (accepting one "error" per 1000 observations) or 0.2 (one "error" per 5 cases). Thus, with a value of $p < \alpha$ the null hypothesis can be rejected and we would go for the alternative hypothesis, otherwise, with $p > \alpha$, we would not have enough evidence to reject the null hypothesis.

On the other hand, how the heck are those hypotheses determined? Well, several "definitions" of what a null hypothesis is can be found such as "What is accepted by default", "Whatever is known or already defined", "What we do not want to test, or more formal "hypothesis that proposes that no statistical significance exists in a set of given observations", while the alternative is "What we are interested in knowing", but the reality is that they are somewhat ambiguous and personal criteria can certainly bias the determination of the hypothesis and end up choosing the wrong one (as may do, incluiding me). In fact, the choice of which hypothesis should be configured as the alternative or null is a much more complex process and will not be discussed in detail here. However, a more descriptive article about this can be found somewhere else. The reality is that many statistical tests have already defined which is their null hypothesis and, therefore, which is the alternative.

In [15]:
# Let's get normal data and point out those "extraordinary" observations
normal = np.random.normal(10, 5, 5000)

# tails based on alpha = 0.05 o just 0.025 for each tail (right and left)
alpha_tail = 0.05 / 2
left_tail = np.quantile(normal, alpha_tail)
right_tail= np.quantile(normal, 1 - alpha_tail)
In [16]:
# Plotting distribution and tails

plt.figure(figsize=(12,5))
axis = sns.kdeplot(normal)
axis.axvline(left_tail, linestyle="dotted", color="green", label=r"$\frac{\alpha}{2}$")
axis.axvline(right_tail,linestyle="dotted", color="green")

# values equal or lower than alpa
X,Y = axis.get_lines()[0].get_data()
mask_left = (X <= left_tail)
mask_right= (X >= right_tail)

axis.fill_between(x=X[mask_left], y1=Y[mask_left], color="green", alpha=0.7, label="Extraordinary observations")
axis.fill_between(x=X[mask_right], y1=Y[mask_right], color="green", alpha=0.7)

axis.legend(fontsize=9)
plt.title(f"\nNormal distribution with $\\alpha$ = {alpha_tail * 2}\n")
plt.show()
In [17]:
# USING THE EXAMPLE ABOVE (PLOT)....
# The p_value is the probability of getting the same or even rarer observations of you data point, for example,
# the probability of getting a value of x=5 or rarer means getting a value of 5 or 4, or 3 etc. Likewise,
# It also encompass getting values of 15 or 16 or 17 etc. Because values of x greater than 15 (kind of) has the same
# probability of getting a value of 5 and we also take into account stranger observations

adjusted_curve = stats.norm(loc=normal.mean(), scale=normal.std())

# Probability of getting equal or rarer values of x=5 (cumulative density function of 5 of only one tail)
x = 5
p_value_left = adjusted_curve.cdf(x)

# The symmetric nature of normal distribution make simple to get the probability of the right tail
p_value = p_value_left * 2

print(f"\nThe p-value of getting the same or rarer observation of our value chosen '{x}' is {p_value}\n")
The p-value of getting the same or rarer observation of our value chosen '5' is 0.3151290712605498

In [18]:
# As you saw above, a probability of 0.3 is not such a rare thing, It's pretty common, therefore our value of x=5
# is not special as you could think. There is even stranger observations

# Getting the exact symmetric value of x on the right tail
right_tail_value_percentile = 100 - stats.percentileofscore(normal, x)
right_tail_value = np.percentile(normal, right_tail_value_percentile)

# Estimating density of the plot because seaborn sometimes is not as good as sklearn
# This is usefull to fill under the curve the plot below
normal_copy = normal.reshape(-1,1)
samples_x_left  = np.arange(min(normal_copy), x, 0.1).reshape(-1, 1)
samples_x_right = np.arange(right_tail_value, max(normal_copy), 0.1).reshape(-1, 1)

kernel_density_object = KernelDensity().fit(normal_copy)
samples_y_left  = np.exp(kernel_density_object.score_samples(samples_x_left))
samples_y_right = np.exp(kernel_density_object.score_samples(samples_x_right))

# Plotting distribution and tails

plt.figure(figsize=(12,5))
axis = sns.kdeplot(normal_copy)
axis.axvline(x, linestyle="dotted", color="green", linewidth=3)
axis.axvline(right_tail_value, linestyle="dotted", color="green", linewidth=3)

# values equal or lower than x (in this case it was 5)
axis.fill_between(x=samples_x_left.reshape(-1,), y1=samples_y_left.reshape(-1,),
                  color="green", alpha=0.7,
                  label=f"$P\_value={round(p_value, 3)}$")
axis.fill_between(x=samples_x_right.reshape(-1,), y1=samples_y_right.reshape(-1,),
                  color="green", alpha=0.7)

axis.legend(fontsize=9)
plt.title(f"\nValues with probabilities equal or lower with $x={x};\ p\_value$\n")
plt.show()

Parametric test¶

Parametric tests are those statisctic tests that make assumptions about the parameters of the population distribution (usually normally distributed) from wich the sample is drawn. In such way, We are able to see in this category some of the following test

  • T-test
  • Z-test
  • Anova
  • Ancova
  • Manova
  • Mancova

Every test shown above is used on different sceneries according to the data used, sample size and purpose of the research. Also, it's worth to note that every test produce a value which needs to be compared to the distribution of the repective test to get the right p-value and, therefore, reject or not the null hypotesis

T-test¶

This parametric test is used to compare the means between two groups (It determines if the means of the two groups are equal between each other).

T-test or Student's t-test is used when the standard deviation of the population is unknown and/or the sample size is lower than 30 (rule of thumb). T-test assumes the following about your data

  1. The sample data have been randomly sampled from a population.
  2. There is homogeneity of variance (i.e., the variability of the data in each group is similar).
  3. The distribution is approximately normal.

The second point discussed above is not exactly true because there is a t-test for samples with unequal variances. In this way, there are 3 t-tests trying to answer different questions.

  • T-test paired (within-subject design): A T-test paired is used If the groups come from the same population (i.e., measure before and after a group under a experimental treatment)

  • T-test of two-samples or independent t-test (between-subjects design): A two-samples T-test is used if two groups come from different populations (i.e., two species or two groups of people from different cities). Independet T-test has two elemental variants. An independent T-test for samples with equal size and variance and an independent T-test for samples with equal or unequal sample size and unequal variances (This is also called a welch's test)

  • One sample T-test: If we only have one group that is compared to a single value, we use one-sample T-test.

On the other hand, we also need to remember those test can be applied to one-tailed and two-tailed experiments.

T-test paired

$$T = \frac{mean_1 - mean_2}{\frac{s(diff)}{\sqrt(n)}}$$$$s(diff)=standard\ deviation\ of\ the\ paired\ differences\ of\ the\ samples$$$$degree\ of\ freedom\ or\ df= n -1$$

One sample T-test

$$T = \frac{\overline{x} - \mu_0}{\frac{s}{\sqrt{n}}}$$$$\overline{x}=mean\ of\ the\ sample$$$$\mu_0 = value\ to\ compare$$$$s = standard\ deviation$$$$df=n-1$$

Independent T-test or two-samples T-test

Equal variances

$$T = \frac{\overline{x_1} - \overline{x_2}}{Sp \sqrt{\frac{1}{n_1} + \frac{1}{n_2}}}$$$$Sp=\sqrt{ \frac{(n_1 - 1)s_1^2 + (n_2 - 1)s_2^2}{n_1 + n_2 -2}}$$$$s_1; s_2 = standard\ deviation$$$$df= n_1 + n_2 - 2$$

Unequal size and unequal variances (also called welch's t-test)

$$T = \frac{\overline{x_1} - \overline{x_2}}{\sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}}$$$$s_1;s_2 = standard\ deviation$$$$df = \frac{(\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2})^2}{\frac{1}{n_1 - 1}(\frac{s_1^2}{n_1})^2 + \frac{1}{n_2 - 1}(\frac{s_2^2}{n_2})^2}$$$$s_1;s_2 = standard\ deviation$$

You probably noticed something called degrees of freedom, in a nutshell, the distribution curve as the sample size decreases changes, this translates as longer and wider tails and a less high peak. Therefore, this curve of our statistic has to be refined according to the sample size, and this is what the degrees of freedom are for.

F-test¶

I mentioned earlier that in order to perform a t-test there must be homogeneity of variance. An F-test is a popular test to assay it.

F-test is usually (but not limited) to assays the variance among two variables and it's sensitive to non-normal data. The null hypotesis states equal variance for the two samples $var_1 = var_2$ and the alternative hypotesis of two-tailed is $var_1 \ne var_2$.

$$ F = \frac{\sigma_1^2}{\sigma_2^2}\ \ \ \ \ \ \ \sigma_1^2;\sigma_2^2 = variances\ of\ the\ population\ or\ the\ samples $$

Depending if the test is set to two-tailed, right-tailed or left-tailed. The order to determine $\sigma_n^2$ vary. For intance, a two-tailed or right-tailed the larger variance between the samples is set to be the numerator, instead in a left-tailed, the smaller value of the variance becomes the numerator.

Other tests to check for homogeneity of variances usually used (sometimes even refered as "more powerfull than the F-test") are levene's test and Barlett test. Those are not going to be discussed in more detail here but It's good to know that they exist.

In [19]:
# Loading dataset
iris_df = sns.load_dataset("iris")
iris_df.sample(10)
Out[19]:
sepal_length sepal_width petal_length petal_width species
36 5.5 3.5 1.3 0.2 setosa
40 5.0 3.5 1.3 0.3 setosa
8 4.4 2.9 1.4 0.2 setosa
89 5.5 2.5 4.0 1.3 versicolor
141 6.9 3.1 5.1 2.3 virginica
146 6.3 2.5 5.0 1.9 virginica
95 5.7 3.0 4.2 1.2 versicolor
12 4.8 3.0 1.4 0.1 setosa
49 5.0 3.3 1.4 0.2 setosa
76 6.8 2.8 4.8 1.4 versicolor
In [20]:
# Filtering samples of setosa and virginica for sepal_length
setosa_df = iris_df.loc[iris_df["species"] == "setosa", :].sepal_length
virginica_df = iris_df.loc[iris_df["species"] == "virginica", :].sepal_length

# Sampling 30 observations to test if the means between the two species are differents (using independent t-test)
sample_size = 30
setosa_sample = setosa_df.sample(sample_size, random_state=42)
virginica_sample = virginica_df.sample(sample_size, random_state=42)
In [21]:
plt.figure(figsize=(10,5))
sns.kdeplot(setosa_sample, label="setosa", linewidth=3)
sns.kdeplot(virginica_sample, label="virginica", linewidth=3)
plt.legend()
plt.title(f"Distribution aproximation using a $sample\_size={sample_size}$")
plt.show()
In [22]:
# Both samples seems normal and obviously the populations differs but, anyway let's do a t-test.
# First let's check if the variances are equal or not using F-test of two-tailed
# H0 = There is no difference between variances var_1 = var_2 if p>0.05
# H1 = There is difference between variances var_1 != var_2   if p<0.05

def Ftest(sample1:"array-like", sample2:"array-like") -> (float, float):
    
    '''
    F-test of the variance
    
    This function does the two-tailed F-test to compare the variance between two populations with null
    hypotesis (H0) describing equal variances and alterntive hypotesis (H1) for unequal variances
    
    Paramaters
    ----------
    
    sample1: array-like
        Sample of the first population
        
    sample2: array-like
        Sample of the second population

    Return
    ------
        (float, float): F-statistic and p_value respectively
    
    Examples
    --------
    
    >>>import scipy.stats as stats
    >>>import numpy as np
    >>>sample_A = [1,2,3,4,5]
    >>>sample_B = [1,2,4,4,5]
    >>>Ftest(sample_A, sample_B)
    (1.08, 0.9423361401911694)
    '''
    
    # Variance of the samples
    sigma_1 = np.var(sample1, ddof=1)
    sigma_2 = np.var(sample2, ddof=1)
    
    numerator, denominator = (sigma_1, sigma_2) if sigma_1 > sigma_2 else (sigma_2, sigma_1)
    F = numerator / denominator

    p_value_one_tailed = 1 - stats.f.cdf(F, np.size(sample1) - 1, np.size(sample2) - 1)
    
    return F, p_value_one_tailed * 2
In [23]:
def ShowTypeVariances(p, alpha, statistic):
    
    print(f'{statistic[0]}: statistic = {statistic[1]}; p_value = {p}')
    
    if p < alpha:
        print("\tThe variances are not equal")
    else:
        print("\tThe variances are equal")

alpha = 0.05
F_statistic, p_value_F = Ftest(setosa_sample, virginica_sample)
ShowTypeVariances(p_value_F, alpha, ["F-test", F_statistic])
F-test: statistic = 3.391227706345914; p_value = 0.001539251598021929
	The variances are not equal
In [73]:
# We'll make also levene and barlett tests using the same alpha

levene_statistic, p_value_l  = stats.levene(setosa_sample, virginica_sample)

# Barlett is more sensitive to non-normal data. Only used if we are really confident that the
# data is normal
barlett_statistic, p_value_b = stats.bartlett(setosa_sample, virginica_sample)

ShowTypeVariances(p_value_l, alpha, ["Levene-test", levene_statistic])
ShowTypeVariances(p_value_b, alpha, ["Barlett-test", barlett_statistic])
Levene-test: statistic = 4.564968517458503; p_value = 0.0368626178547197
	The variances are not equal
Barlett-test: statistic = 10.027323487542784; p_value = 0.0015423499256511143
	The variances are not equal
In [25]:
# The variances are not equal as you already see, therefore;
# we will use an t-test for unequal variances (welch's test)
independet_ttest = stats.ttest_ind(setosa_sample, virginica_sample, equal_var=False)

if independet_ttest.pvalue < alpha:
    print("The means between setosa and virginica are statistically different (mu_1 != mu_2)")
else:
    print("The means between setosa and virginica are statistically equal (mu_1 = mu_2)")
The means between setosa and virginica are statistically different (mu_1 != mu_2)

An independent t-test for two species was done with unequal variances. However, using stats.ttest_rel we are able to run a paired-sample t-test and stats.ttest_1samp for one sample and by means of the parameter alternative choose among two-tailed, right or left tail.

Z-test¶

Z-test is another parametric test to compare means between groups, more formally internet says "z-test is a statistical test to determine whether two population means are different when the sample size is large and the variances are known" As with a T-test to compare means, a Z-test has the same function but assumes that we do know the standard deviation about the population from which the sample came and that it follows a normal distribution. Similarly, a Z-test can be performed one-sample and for two independent samples, either two-tailed or one-tailed.

Two-sample

$$z = \frac{\overline{x_1}\ –\ \overline{x_2}}{ \sqrt{\frac{\sigma_1^2}{n_1} + \frac{\sigma_2^2}{n_2}}}$$$$\sigma_1;\sigma_2=population\ standard\ deviations$$$$\overline{x_1};\overline{x_2}=sample\ means$$

One-sample

$$Z=\frac{(\overline{X}-\mu_{0})}{\sigma}$$$$\overline{X}=mean\ of\ the\ sample$$$$\mu_{0}=mean\ to\ compare$$$$\sigma=population\ standard\ deviation$$

In addition, a Z-test can be applied for two proportions

$$Z = \frac{\hat{p}_1 - \hat{p}_2}{\sqrt{\hat{p}(1 - \hat{p})( \frac{1}{n_1} + \frac{1}{n_2})}}$$$$\hat{p}_1;\hat{p}_2 =proportions\ of\ samples$$$$\hat{p}=mean\ of\ both\ samples$$

T-test vs Z-test¶

Both tests are used for comparison of means for normal data, but which one to use? Well, one may find that for sample sizes smaller than 30 a t-test seems to be used almost exclusively, while for those larger than that, a Z-test must be your choice regardless of whether or not the population standard deviation is known. This makes sense in light of the fact that a standard distribution T becomes more similar to the Z-distribution as the number of the sample size increases (central limit theorem) and the standard deviation of the population can be aproximated with the sample (that's the reason why in some notations of the Z-test you will find the formula using the difference of the means divided by the standard error of the sample instead of the standard deviation of the population, similar in the one-sample t-test does). However, strictly speaking some mention that if your sample size is larger than 30 and you do NOT know the variance or standard deviation of the population you would be making a mistake (not a big one actually) when using a Z-test, since you will get a slight error in the statistics obtained.

As mentioned earlier, the distribution curve changes with respect to the sample number, that is why a few degrees of freedom are defined for the t-test, which is not necessary for a Z distribution. The lower the degrees of freedom, a curve with wider tails is obtained, as the degrees tend to infinity, a Z-distribution will be obtained (with flatter tails).

Apparently the use of a Z-test when the sample size is greater than 30 has a historical connotation, not all statisticians had access to tables of p-value probabilities for all degrees of freedom, as opposed to a table for the Z distribution (again, where degrees of freedom are not necessary) and, since both are similar as the sample size increases, it was ok.

With that small reflection, nowadays we have programs with all those values calculated and therefore, in my personal understanding, independently of the sample size, if we do not know the standard deviation of the population we should NOT use a Z-test, but a t-test. But on the other hand, if these calculations are done manually, then it may be convenient to use a Z-test just for ease, with a minimal error that can sometimes be negligible.

In [26]:
## Let's calculate a Z-test of the following problem found on internet (from
# youtube channel statslectures)

# In the population, the average IQ is 100 with a standard deviation of 15. A team
# of scientists wants to test a new medication to see if it has either a positive or
# negative effect on intelligence, or no effect at all. A sample of 30 participants who
# have taken the medication has a mean of 140. Did the medication affect intelligence?

# Notes
# 1. This is a two-tailed problem because the exercise states for either a negative or positive effect,
#    not only for a positive effect (right tail) or only a negative effect (left tail)
# 2. This a one sample Z-test because we are comparing a sample versus a given value. We are not comparing
#    two samples or two populations
# 3. Let's set a confidence level of 95% or alpha of 0.05
#
# H0: There is no difference between the IQ mean of the participants (X_1 = 140) and the IQ of (X_2 = 100)
#      X_1 = X_2
# H1; There is a difference between the IQ mean of the participants (X_1 = 140) and the IQ of (X_2 = 100)
#       X_1 != X_2

std_population = 15
participants_mu = 100
mu_value = 140
alpha = 0.05

Z_statistic = (participants_mu - mu_value) / std_population
Z_distribution = stats.norm(loc=0, scale=1)

p_value_one_tail = Z_distribution.cdf(Z_statistic)
p_value_two_tail = p_value_one_tail * 2

if p_value_two_tail < alpha:
    print("The drug had an effect on the participants c: !")
else:
    print("The frug had no effect on the participants :c ")

print(f"p_value: {p_value_two_tail}")
The drug had an effect on the participants c: !
p_value: 0.007660761135179473
In [27]:
# On the other hand, here is an interactive graphic representation of the
# Z-distribution (standard normal distribution) and t-distribution using
# different degrees of freedom. It's easier to plot using a data frame

def t_distribution_standard(df, x):
    return stats.t.pdf(x=x, df=df, loc=0, scale=1)

# Range of values to retrieve density of Z-distribution and T-distribution
x_values = np.arange(-5, 5.025, 0.025)

# Normal data for plotting
normal_distribution_plot = stats.norm.pdf(x=x_values, loc=0, scale=1)

t_distributions = np.empty((x_values.shape[0], 3))
df_t_distributions = pd.DataFrame()

# Retrieving the density of t-distributions based on the degrees
for degree in range(1, 31):

    t_distributions[:,0] = degree
    t_distributions[:,1] = x_values.copy()
    t_distributions[:,2] = t_distribution_standard(degree, x_values)
    
    df_t_distributions = pd.concat([df_t_distributions, pd.DataFrame(t_distributions)])
    
df_t_distributions.rename({0:"Degree", 1:"Std", 2:"Density"}, inplace=True, axis=1)
df_t_distributions["Degree"] = df_t_distributions["Degree"].transform(int)
In [28]:
fig = px.line(df_t_distributions, x="Std", y="Density",
                 animation_frame="Degree", range_y=[0,0.45],
                 template="plotly_dark")

fig.update_traces(line=dict(width=3), name="T-distribution", showlegend=True)
fig.update_layout(title_text='T-distributions vs Z-distribution', title_x=0.45)

fig.add_scatter(x=x_values, y=normal_distribution_plot,
                line=dict(width=3,
                          color="orange",
                          dash="dash"),
                name="Normal (Z-distribution)",
                hovertemplate='Std=%{x}<br>Density=%{y}'
               )

fig.update_layout(
    margin=dict(l=30, r=150, t=50, b=10),
)


#fig["layout"].pop("updatemenus")
fig.show()

Anova¶

Para una descripción detallada acerca de ésta prueba estadística puede ser consultado en ANOVA análisis de varianza para comparar múltiples medias por Joaquín Amat Rodrigo) del cuál provienen las explicaciones posteriores.

Ahora sí, en breve, Anova es un conjunto de pruebas estadísticas usadas para comparar la media entre dos o más grupos. Similarmente a una t-test, Anova introduce tres tipos principales de análisis

  • Anova de una vía para comparar la media ente más de dos grupos con un factor y una variable respuesta (por ejemplo, la respuesta ante un fármaco (cuantitativo) para mejorar la concentración en diferentes edades [para niños, adultos o ancianos])

  • Anova para múltiples factores y una respuesta (por ejemplo, la respuesta ante un fármaco (cuantitativo) para mejorar la concentración en diferentes sexos [para hombre o mujer] de distintas edades [para niños, adultos o ancianos])

  • Anova para datos dependientes, pareados o repetidos (por ejemplo, la respuesta ante un fármaco (cuantitativo) para mejorar la concentración antes y depués de haber sido ingerido para distintas edades [para niños, adultos o ancianos])

En general, aquellas pruebas comparten algunos supuestos

  • Indepencia: Las observaciones deben ser aleatorias y los grupos o factores independientes entre ellos (supuesto no válido para Anova de datos repetidos)

  • Homocedasticidad: La varianza dentro de los grupos debe de ser aproximadamente igual en todos ellos. (Supuesto no válido para ANOVA heterodástica que emplea la corrección de Welch o Welch test)

  • Distribución normal enter grupos

Por otro lado, cada uno de las pruebas Anovas mencionadas antes podrán tener menos supuestos o supuestos adicionales.

El estadístico obtenido por una prueba de Anova es el F_{ratio} que sigue una distribución "F de Fisher-Snedecor" y se elabora bajo la hipótesis nula que la media entre todos los grupos estudiados es la misma o no hay diferencia, también denotado como $\mu_1 = \mu_2 = \mu_3 ...... = \mu_n$, por el contrario, la hipótesis alternativa establece que la media de al menos dos grupos difieren de forma significativa. Nota así, que una prueba Anova no nos indica cuáles de todos esos grupos sus medias resultaron distintas. Para ello, métodos post-hoc (métodos despúes de rechazar la hipótesis nula) como el Tukey-Kramer Honest Significant Difference (HSD) o comparaciones pareadas de t-test con correcciones para evitar inflación del error tipo uno como el Holm–Bonferroni Adjustment son usados.

Mientras tanto un estadístico que acompaña a una prueba ANOVA es el tamaño del efecto, denotado como $\eta^2$ y siendo éste es el valor que permite medir cuanta varianza en la variable dependiente cuantitativa es resultado de la influencia de la variable cualitativa independiente, o lo que es lo mismo, cuanto afecta la variable independiente (factor) a la variable dependiente y es calculada como:

$$\eta^2 = \frac{SS_{effect}}{SS_{total}}$$$$SS_{effect}= The\ sum\ of\ squares\ of\ an\ effect\ for\ one\ variable$$$$SS_{total}= The\ total\ sum\ of\ squares\ in\ the\ ANOVA\ model$$

Los niveles de clasificación más empleados para el tamaño del efecto son:

$$0.01 = pequeño$$$$0.06 = mediano$$$$0.14 = grande$$
One-way Anova¶
In [74]:
%%R

data_1 <- c(13,11,8,11,8)
data_2 <- c(13,11,14,14,10)
data_3 <- c(4,1,3,4,2)

data_df_1 <- data.frame(data=data_1)
data_df_1$Groups <- "A"

data_df_2 <- data.frame(data=data_2)
data_df_2$Groups <- "B"

data_df_3 <- data.frame(data=data_2)
data_df_3$Groups <- "C"

data <- rbind(data_df_1, data_df_2, data_df_3)

anova <- aov(data ~ Groups, data)
summary(anova)
            Df Sum Sq Mean Sq F value Pr(>F)
Groups       2  16.13   8.067   2.142   0.16
Residuals   12  45.20   3.767               
In [72]:
%%R

plot(anova)

Ancova, Manova y Mancova¶

Non-parametric¶

In [68]:
?robjects.environments
Type:        module
String form: <module 'rpy2.robjects.environments' from '/home/lromero/mambaforge/envs/DataScience/lib/python3.10/site-packages/rpy2/robjects/environments.py'>
File:        ~/mambaforge/envs/DataScience/lib/python3.10/site-packages/rpy2/robjects/environments.py
Docstring:   <no docstring>
In [ ]: